COURSES

Introduction

Welcome to LADAL Courses. This page provides structured learning pathways for students and researchers who want to apply computational and quantitative methods to the study of language and the humanities. Whether you are a complete beginner picking up R for the first time or an experienced analyst looking to extend your skills into natural language processing or advanced statistics, there is a pathway here for you.

What Are LADAL Courses?

LADAL Courses are curated sequences of tutorials, readings, and practical exercises designed to help learners progress systematically from foundational knowledge to more advanced skills, using free, open, and reproducible tools. All courses are built around resources from the Language Technology and Data Analysis Laboratory (LADAL) and use R as the primary analysis environment.

Courses are available in two formats:

  • Short Courses — self-paced, independent-study sequences of 6–10 tutorials covering one focused topic. Ideal for researchers who want to build a specific skill quickly, or for instructors looking for a compact module to embed in a larger course.
  • Long Courses — structured 12-week, semester-length courses with weekly lecture topics, LADAL tutorials, and recommended readings. Designed to serve as a complete scaffold for university courses, or for motivated independent learners who want a thorough grounding in a field.

By following a LADAL course, learners will be able to:

  • Understand the key concepts and principles behind quantitative research and computational text analysis
  • Develop practical skills in R, including data management, visualisation, statistics, and text analytics
  • Apply statistical and computational methods to real-world research questions in linguistics, the humanities, and the social sciences
  • Build reproducible and transparent research workflows, ensuring robust and reliable analyses

Who Is This For?

The table below gives a quick overview of each course’s intended audience and assumed background. Courses are listed in approximate order from most introductory to most advanced.

Course overview by audience and background
Course Format Audience Assumed background
Introduction to Language Technology Short Linguists and humanities students curious about language technology None
Intro to Corpus Linguistics Short Linguistics students; language teachers and researchers None
Intro to Text Analysis Short Humanities and social science students None
Data Visualisation for Linguists Short Linguists and language researchers Basic R helpful
Introduction to Statistics Short Humanities and social science students None
Introduction to Learner Corpus Research Short Applied linguists; SLA researchers; language teachers Basic corpus linguistics helpful
Introduction to Digital Humanities with R Long Humanities researchers and students None
Intro to Corpus Linguistics and Text Analysis Long Linguistics and applied linguistics students None
Introduction to Statistics Long Humanities and social sciences students and researchers None
Natural Language Processing with R Short Computational linguists; data scientists working with language Intermediate R; basic statistics
Advanced Statistics Long Researchers with prior statistics knowledge Basic statistics and R

Short Courses

Short courses are designed for independent learners and can be worked through at your own pace. They cover some of LADAL’s most popular topics and consist of 6–10 tutorials, organised in a logical sequence from foundational to applied.


Introduction to Language Technology

Duration: 6 tutorials · Self-paced
Audience: Anyone curious about how computers process and analyse language — no prior programming or linguistics background required
Aim: Provide a conceptual and practical first introduction to language technology: what it is, what it can do, and how to get started using freely available tools in R

Language technology encompasses the computational tools and methods used to analyse, generate, and interact with human language. This short course introduces learners to the landscape of language technology — from basic text processing and regular expressions to corpus tools, OCR, and an overview of modern NLP — with hands-on practice in R. By the end, learners will understand the key methods and be equipped to explore more specialised pathways.

What you will gain:

  • A conceptual map of language technology and its applications in linguistics and the humanities
  • Practical experience with text data in R: loading, cleaning, and exploring text
  • Familiarity with regular expressions as a foundation for all text-analytic work
  • Hands-on experience with OCR for converting PDFs and scanned documents to text
  • An understanding of how corpus tools and NLP pipelines are constructed

1. Introduction to Text Analysis
An overview of the field: what text analysis is, how it relates to corpus linguistics and NLP, and what kinds of research questions it can address. Introduces key concepts including corpus, token, type, frequency, and concordance.

2. Getting Started with R
Your first introduction to R and RStudio, covering the essential operations needed for text analysis: installing packages, loading data, working with vectors and data frames, and writing simple functions.

3. Loading and Saving Data
How to import text data into R from a variety of formats — plain text files, CSV, Excel, and web URLs — and how to save results for later use.

4. String Processing
An introduction to working with character strings in R using stringr. Covers pattern matching, substitution, splitting, and the basic string operations that underpin all subsequent text analysis.

5. Regular Expressions
A practical introduction to regular expressions (regex) — the pattern language used to search, extract, and transform text. Covers character classes, quantifiers, anchors, and look-arounds with worked linguistic examples.

6. Converting PDFs to Text
How to extract machine-readable text from PDF documents using pdftools (for digitally generated PDFs) and tesseract (for scanned documents), including post-OCR spell-checking workflows.


Introduction to Corpus Linguistics

Duration: 7 tutorials · Self-paced
Audience: Linguistics students; language teachers; researchers new to corpus methods
Aim: Introduce the core methods of corpus linguistics — concordancing, collocations, keyness analysis, and frequency-based exploration — using R and reproducible workflows

Corpus linguistics uses large, principled collections of authentic text (corpora) to investigate patterns of language use. This short course takes learners from a conceptual introduction to the field through hands-on practice with the most widely used corpus methods, culminating in a case-study showcase that demonstrates how the individual techniques combine into a full corpus-based analysis.

What you will gain:

  • A clear understanding of what a corpus is and how corpus-based research differs from introspective approaches
  • Practical skills in frequency analysis, concordancing, collocation identification, and keyword extraction using R
  • The ability to design, conduct, and report a reproducible corpus-based study
  • Familiarity with key R packages for corpus linguistics: quanteda, tidytext, and related tools

1. Introduction to Text Analysis
Introduces corpus linguistics and text analysis as fields, defining key concepts — corpus, concordance, collocation, keyword, frequency — that are used throughout the course.

2. Getting Started with R
First introduction to R and RStudio. Focus on the first four sections (up to Working with Tables) for the purposes of this course.

3. String Processing
Essential string manipulation skills for handling raw text data in R: pattern matching, substitution, tokenisation preparation, and whitespace management.

4. Concordancing (Keywords-in-Context)
How to search a corpus for words and phrases and display the results as KWIC (keyword-in-context) concordances using R. Covers sorting, filtering, and interpreting concordance output.

5. Collocation and N-gram Analysis
How to identify statistically significant word collocations and extract n-gram sequences from a corpus, including association measures (PMI, log-likelihood, t-score) and visualisation.

6. Keyness and Keyword Analysis
How to compare two corpora and identify the words that are statistically more or less frequent in one relative to the other — the foundation of contrastive corpus analysis.

7. Corpus Linguistics with R
A capstone showcase presenting complete case studies that integrate concordancing, frequency analysis, collocations, and keyness into full corpus-based research workflows.


Introduction to Text Analysis

Duration: 7 tutorials · Self-paced
Audience: Humanities and social science students; researchers wanting computational approaches to text
Aim: Take learners from a conceptual introduction to text analysis through to applied techniques including topic modelling, sentiment analysis, and network analysis

Text analysis uses computational methods to extract patterns, topics, sentiment, and relational structure from large collections of text. This course introduces the key methods in sequence, building from foundational R skills and string processing to more sophisticated analyses. By the end, learners will be able to apply a range of text-analytic methods to their own research texts.

What you will gain:

  • An understanding of the major families of computational text analysis and their research applications
  • Practical R skills for cleaning, processing, and analysing text data
  • Hands-on experience with topic modelling, sentiment analysis, and network analysis
  • The ability to select the most appropriate method for a given research question

1. Introduction to Text Analysis
Overview of the field and key concepts. Situates text analysis within broader computational humanities and social science research.

2. Getting Started with R
First introduction to R and RStudio. Focus on the first four sections for the purposes of this course.

3. String Processing
Core string manipulation skills for preparing raw text for analysis.

4. Practical Overview of Selected Text Analytics Methods
A hands-on introduction to common text analysis methods — frequency analysis, TF-IDF, basic classification — using the R skills built in the previous tutorials.

5. Topic Modelling
Introduces Latent Dirichlet Allocation (LDA) and related topic models for discovering thematic structure in document collections. Covers both theory and practical implementation in R.

6. Sentiment Analysis
Introduces lexicon-based and machine-learning approaches to extracting opinion and emotion from text. Covers dictionary methods, valence shifting, and visualisation.

7. Network Analysis
Introduces network (graph) analysis as a method for representing relational structure in textual and social data. Covers node and edge construction, centrality measures, and visualisation in R.


Data Visualisation for Linguists

Duration: 6 tutorials · Self-paced
Audience: Linguists and language researchers who want to communicate their findings more effectively; anyone who produces graphs and tables in R
Aim: Develop principled, publication-quality data visualisation skills using ggplot2 and related tools, with linguistic data as the running example

Effective visualisation is one of the most transferable skills in quantitative research. This course builds from the principles of good visualisation design through the mechanics of ggplot2, covering the graph types most commonly needed in linguistics and language research: frequency distributions, scatter plots, heat maps, geographic maps, and interactive visualisations. Special attention is given to colour accessibility, annotations, and formatting for publication.

What you will gain:

  • A principled understanding of what makes a graph effective or misleading
  • Practical skills with ggplot2: the grammar of graphics, geoms, scales, facets, themes, and annotations
  • The ability to produce publication-quality static and interactive visualisations from linguistic data
  • Confidence in choosing the right graph type for the right data and research question

1. Getting Started with R
Introduction to R and RStudio with a focus on the data structures and workflow needed for visualisation.

2. Introduction to Data Visualisation
Introduces visualisation philosophy, perceptual principles, and the grammar of graphics. Covers when to use which chart type and common pitfalls.

3. Descriptive Statistics
Covers the summary statistics — means, medians, distributions, variance — that underpin most visualisations of linguistic data.

4. Data Visualisation with R
In-depth ggplot2 tutorial covering the most important geoms for linguistic data: histograms, density plots, box plots, bar charts, scatter plots, and line graphs, with worked examples from corpus and experimental linguistics.

5. Visualising and Analysing Survey Data
Covers visualisation methods specific to Likert-scale and questionnaire data: cumulative density plots, diverging stacked bar charts, and the likert package.

6. Maps and Spatial Visualisation
Introduces geographic visualisation of linguistic data — dialect maps, distribution maps, and choropleth maps — using ggplot2 and sf.


Introduction to Statistics in the Humanities and Social Sciences

Duration: 7 tutorials · Self-paced
Audience: Humanities and social science students and researchers with little or no prior knowledge of statistics
Aim: Build statistical literacy and practical quantitative skills from the ground up, using R as the analysis environment throughout

This course provides a conceptual and practical introduction to statistics for researchers whose background is in the humanities or social sciences. It begins with the philosophical foundations of quantitative reasoning and builds systematically through descriptive statistics, visualisation, and inferential testing. No prior statistical knowledge is assumed; by the end, learners will be able to conduct and interpret basic statistical analyses and communicate their results clearly.

What you will gain:

  • A solid conceptual understanding of statistical thinking, probability, and hypothesis testing
  • Practical skills in R for summarising, tabulating, visualising, and testing data
  • The ability to select, apply, and interpret the most common inferential tests (t-tests, chi-square, correlation, simple regression)
  • Confidence in reading and critically evaluating quantitative results in published research

1. Introduction to Quantitative Reasoning
A conceptual introduction to scientific thinking, the logic of hypothesis testing, and the role of quantitative methods in humanities and social science research.

2. Basic Concepts in Quantitative Research
Defines the core concepts of statistics: variables, data types, sampling, populations, reliability, and validity.

3. Getting Started with R
Introduction to R and RStudio. Focus on the first four sections for the purposes of this course.

4. Handling Tables in R
How to work with tabular data in R: importing, cleaning, reshaping, and summarising data frames using dplyr and tidyr.

5. Descriptive Statistics
How to summarise and describe data numerically and visually: means, medians, standard deviations, distributions, and frequency tables.

6. Introduction to Data Visualisation
Principles of data visualisation and hands-on practice creating and customising graphs in R.

7. Basic Inferential Statistics
Introduction to hypothesis testing, p-values, confidence intervals, t-tests, chi-square tests, correlation, and simple linear regression — with practical exercises in R throughout.


Introduction to Learner Corpus Research

Duration: 7 tutorials · Self-paced
Audience: Applied linguists; SLA researchers; language teachers and test developers; corpus linguists with an interest in learner language
Aim: Introduce the methods and concepts of Learner Corpus Research (LCR) using R, from corpus construction and basic frequency analysis through to lexical diversity, readability, and error analysis

Learner corpus research uses collections of authentic language produced by second-language (L2) learners to investigate the structure, development, and distinctiveness of interlanguage. This course introduces learners to the major analytical methods used in LCR — concordancing, frequency comparison, collocation, POS tagging, lexical diversity, and error analysis — using the ICLE and LOCNESS corpora as running examples.

What you will gain:

  • An understanding of what learner corpora are, how they differ from native-speaker corpora, and how they are used in SLA research
  • Practical skills for comparing learner and native-speaker language quantitatively using R
  • Experience with lexical diversity measures, readability scores, and spelling error detection
  • The ability to design and conduct a basic learner corpus study and interpret its findings in the context of SLA theory

1. Introduction to Text Analysis
Overview of text analysis and corpus linguistics. Introduces the key concepts — corpus, frequency, concordance, collocation — that underpin learner corpus research.

2. Getting Started with R
Introduction to R and RStudio with a focus on the data structures and workflow needed for corpus analysis.

3. String Processing
Core string handling skills for working with raw learner corpus texts: cleaning, normalising, splitting, and extracting character patterns.

4. Concordancing (Keywords-in-Context)
How to extract and inspect KWIC concordances from learner texts, and how to use concordancing to investigate how learners use specific words or constructions.

5. Collocation and N-gram Analysis
How to identify and compare collocational patterns between learner and native-speaker corpora — a core method for studying collocational competence and L1 transfer effects.

6. Analysing Learner Language with R
A comprehensive tutorial covering the full range of learner corpus analysis methods: frequency comparison, POS tagging, lexical diversity, readability scores, and spelling error detection, with worked examples from ICLE and LOCNESS.

7. Keyness and Keyword Analysis
How to identify words that are systematically over- or under-used by learners relative to native-speaker norms — one of the most informative methods in learner corpus research.


Natural Language Processing with R

Duration: 7 tutorials · Self-paced
Audience: Computational linguists; data scientists working with language data; linguists wanting to apply NLP methods
Assumed background: Intermediate R skills; basic familiarity with statistics (descriptive statistics, simple regression)
Aim: Introduce the core methods of applied NLP in R, from text preprocessing and feature extraction through to classification, named entity recognition, and transformer-based text representations

Natural language processing (NLP) builds on corpus and statistical methods to develop computational pipelines for understanding and generating language at scale. This course introduces learners to the NLP workflow in R, using real linguistic datasets throughout. Topics progress from foundational text preprocessing and feature engineering to supervised classification, topic models, and an introduction to working with large language model (LLM) embeddings and APIs.

What you will gain:

  • A clear understanding of the NLP pipeline: from raw text to structured, analysable representations
  • Practical skills in text preprocessing, tokenisation, stopword removal, stemming, and lemmatisation
  • Experience building document-feature matrices and applying TF-IDF weighting
  • Hands-on practice with text classification, named entity recognition (NER), and dependency parsing
  • An introduction to word embeddings and transformer-based representations

1. Introduction to Text Analysis
Conceptual overview of the text analysis landscape, situating NLP within corpus linguistics and computational linguistics.

2. String Processing
Foundation string manipulation skills — essential for all preprocessing steps in NLP pipelines.

3. Regular Expressions
In-depth introduction to regex as the primary pattern-matching tool in text preprocessing and feature extraction.

4. Practical Overview of Selected Text Analytics Methods
Hands-on introduction to document-feature matrices, TF-IDF, and basic classification workflows in R.

5. Topic Modelling
Probabilistic topic models as an unsupervised NLP method for discovering thematic structure in large text collections.

6. Analysing Learner Language with R
A rich applied NLP example: POS tagging with udpipe, sequence analysis, and lexical diversity measures — all key NLP tasks applied to real corpus data.

7. Network Analysis
How to represent and analyse relational structure in language data using graph methods — applicable to semantic networks, co-occurrence graphs, and social networks of linguistic interaction.


Long Courses

Long courses are structured as 12-week, semester-length programmes. They are designed to serve as a complete scaffold for university courses — each week provides a lecture topic, one or more LADAL tutorials, and recommended readings from key texts in the field. Independent learners are welcome to follow along using the tutorials and reading lists as a structured self-study guide.

Each course is organised as a series of weekly sessions with:

  • Lecture topics outlining the main concepts and learning objectives for the week
  • LADAL tutorials providing step-by-step, hands-on exercises with reproducible R code
  • Recommended readings to reinforce and expand the theoretical and methodological foundations

Introduction to Digital Humanities with R

Duration: 12 weeks (1h lecture + 1.5h tutorial per week)
Audience: Students and researchers in the humanities — literature, history, cultural studies, linguistics, media studies — who want to develop computational skills for digital research
Assumed background: None — no programming or statistics experience required
Aim: Introduce digital humanities methods and tools using R, moving from foundational data literacy and text processing through to corpus analysis, visualisation, network analysis, and a final project

Digital humanities applies computational methods to humanistic inquiry: analysing large literary corpora, mapping cultural data geographically, tracing discourse patterns across historical archives, or modelling networks of social interaction. This 12-week course introduces students to the core DH toolkit through the lens of R, with weekly hands-on tutorials grounded in real humanities datasets. No prior programming experience is assumed; by the end of the course, students will be able to design, conduct, and communicate a reproducible computational analysis of a humanities dataset.

Audience: Students and researchers in literature, history, linguistics, media studies, cultural studies, and related humanities disciplines.

Structure: 12 weekly sessions (1h lecture + 1.5h tutorial)

Week 1: What Is Digital Humanities?
  • Lecture: Overview of digital humanities: history, debates, and current landscape; relationship to corpus linguistics, text analysis, and data science; what counts as DH research
  • LADAL content: Introduction to Text Analysis
  • Additional Readings:
    Burdick, A., et al. (2012). Digital humanities. MIT Press, Ch. 1
    Drucker, J. (2021). The digital humanities coursebook. Routledge, Ch. 1
Week 2: Reproducible Research and Data Management
  • Lecture: Why reproducibility matters in DH; introduction to R and RStudio; file organisation, project workflows, and version control basics
  • LADAL content: Reproducible Research and Creating R Notebooks
  • Additional Readings:
    Flanagan, J. (2025). Reproducibility, replicability, robustness, and generalizability in corpus linguistics. International Journal of Corpus Linguistics. https://doi.org/10.1075/ijcl.24113.fla
Week 3: Getting Started with R
  • Lecture: Introduction to R syntax, data types, vectors, and data frames; the tidyverse ecosystem; reading and writing data
  • LADAL content: Getting Started with R and Loading and Saving Data
  • Additional Readings:
    Wickham, H., & Grolemund, G. (2016). R for data science. Ch. 1–3. https://r4ds.had.co.nz
Week 4: Working with Text Data
  • Lecture: How text is represented computationally; encoding, tokenisation, and the document-feature matrix; from raw text to structured data
  • LADAL content: String Processing and Regular Expressions
  • Additional Readings:
    Jockers, M. L. (2014). Text analysis with R for students of literature. Springer, Ch. 1–3
Week 5: Building and Exploring Digital Corpora
  • Lecture: What is a corpus? Sampling principles, metadata, corpus design for humanities research; downloading and preparing digital texts
  • LADAL content: Downloading Texts from Project Gutenberg and Converting PDFs to Text
  • Additional Readings:
    Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics. Cambridge University Press, Ch. 1–2
Week 6: Frequency Analysis and Visualisation
  • Lecture: Zipf’s law and frequency distributions; word counts, type-token ratios, and dispersion; principles of effective visualisation for humanities data
  • LADAL content: Introduction to Data Visualisation and Descriptive Statistics
  • Additional Readings:
    Jockers (2014), Ch. 4–5
Week 7: Concordancing, Collocations, and Keywords
  • Lecture: Searching corpora; KWIC concordances and their interpretation; collocation and association measures; keyness and corpus comparison
  • LADAL content: Concordancing with R and Keyness Analysis
  • Additional Readings:
    Baker, P. (2006). Using corpora in discourse analysis. Continuum, Ch. 3–4
Week 8: Topic Modelling and Thematic Analysis
  • Lecture: Latent Dirichlet Allocation (LDA); interpreting topics; applications in literary and historical research; limitations and critical perspectives
  • LADAL content: Topic Modelling
  • Additional Readings:
    Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84
    Maier, D., et al. (2021). Applying LDA topic modeling in communication research. In Computational methods for communication science (pp. 13–38). Routledge
Week 9: Sentiment Analysis and Opinion Mining
  • Lecture: Lexicon-based and machine learning approaches to sentiment; subjectivity, valence, and emotion; applications in literary and media studies
  • LADAL content: Sentiment Analysis
  • Additional Readings:
    Liu, B. (2012). Sentiment analysis and opinion mining. Ch. 1–2
Week 10: Network Analysis for Humanities Research
  • Lecture: Graphs and networks as representations of humanistic data; character networks in fiction; citation networks; social networks in historical sources; centrality and community detection
  • LADAL content: Network Analysis
  • Additional Readings:
    Moretti, F. (2011). Network theory, plot analysis. New Left Review, 68, 80–102
Week 11: Maps, Space, and Geographic Visualisation
  • Lecture: Spatial thinking in digital humanities; mapping literary geography, dialect distribution, and historical events; choropleth maps and point maps in R
  • LADAL content: Maps and Spatial Visualisation
  • Additional Readings:
    Drucker (2021), Ch. 8
Week 12: Project Workshop and Critical Reflections
  • Lecture: Critical DH — bias in corpora and algorithms, data ethics, representation, and positionality; communicating DH research; the future of digital humanities
  • Tutorial: Student project presentations and peer feedback
Core Readings
  • Baker, P. (2006). Using corpora in discourse analysis. Continuum.
  • Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics: Investigating language structure and use. Cambridge University Press.
  • Burdick, A., Drucker, J., Lunenfeld, P., Presner, T., & Schnapp, J. (2012). Digital humanities. MIT Press.
  • Drucker, J. (2021). The digital humanities coursebook. Routledge.
  • Flanagan, J. (2025). Reproducibility, replicability, robustness, and generalizability in corpus linguistics. International Journal of Corpus Linguistics. https://doi.org/10.1075/ijcl.24113.fla
  • Jockers, M. L. (2014). Text analysis with R for students of literature. Springer.
  • Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies, 5(1), 1–167.
  • Maier, D., et al. (2021). Applying LDA topic modeling in communication research. In Computational methods for communication science (pp. 13–38). Routledge.
  • Wickham, H., & Grolemund, G. (2016). R for data science. O’Reilly. https://r4ds.had.co.nz

Introduction to Corpus Linguistics and Text Analysis with R

Duration: 12 weeks (1h lecture + 1.5h tutorial per week)
Audience: Students in linguistics, applied linguistics, translation, communication, and literary studies
Assumed background: None
Aim: Introduce corpus-based methods for linguistic analysis and hands-on text analysis with R, from corpus construction through to sentiment analysis, topic modelling, and network analysis

Week 1: Introduction to Corpus Linguistics and Text Analytics
  • Lecture: What is corpus linguistics? Key concepts, history, and applications; corpus vs. introspective and experimental methods; overview of the course
  • LADAL content: Introduction to Text Analysis
  • Additional Readings:
    McEnery, T., & Hardie, A. (2012). Corpus linguistics: Method, theory and practice. Cambridge University Press, Ch. 1–2
Week 2: Working with Digital Data and Reproducibility
  • Lecture: Principles of reproducible research; introduction to R Notebooks; file management and workflow
  • LADAL content: Reproducible Research and Creating R Notebooks
  • Additional Readings:
    Flanagan, J. (2025). Reproducibility in corpus linguistics. International Journal of Corpus Linguistics. https://doi.org/10.1075/ijcl.24113.fla
Week 3: Getting Started with R
Week 4: Corpus Compilation and Preparation
  • Lecture: Types of corpora; sampling principles and representativeness; metadata and annotation; legal and ethical issues in corpus construction
  • LADAL content: Downloading Texts from Project Gutenberg
  • Additional Readings:
    Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics. Cambridge University Press, Ch. 1–2
Week 5: Frequency and Dispersion
  • Lecture: Counting words and n-grams; Zipf’s law; normalised frequencies; dispersion measures and why they matter; type-token ratio
  • LADAL content: Handling Tables in R
  • Additional Readings:
    McEnery & Hardie (2012), Ch. 3
    Gries, S. T. (2024). Frequency, dispersion, association, and keyness. Ch. 1–2
Week 6: Concordancing and KWIC
  • Lecture: Searching corpora; concordance displays and their interpretation; sorting and filtering; from examples to patterns
  • LADAL content: Concordancing with R
  • Additional Readings:
    Baker, P. (2006). Using corpora in discourse analysis. Ch. 3
Week 7: Collocations and N-grams
  • Lecture: Association measures (MI, t-score, log-likelihood, Dice); phraseology and formulaic sequences; n-gram extraction and analysis
  • LADAL content: Collocation and N-gram Analysis
  • Additional Readings:
    Gries (2024), Ch. 2
Week 8: Keywords and Keyness
  • Lecture: Reference corpora and keyness; log-likelihood and log ratio as keyness measures; interpretation and applications in discourse analysis
  • LADAL content: Keyness and Keyword Analysis
  • Additional Readings:
    Gries (2024), Ch. 3
Week 9: Advanced Text Analytics I — Topic Modelling
  • Lecture: Unsupervised text classification; LDA and its assumptions; interpreting and validating topic models; applications in linguistics and discourse analysis
  • LADAL content: Topic Modelling
  • Additional Readings:
    Maier, D., et al. (2021). Applying LDA topic modeling in communication research (pp. 13–38)
Week 10: Advanced Text Analytics II — Sentiment and Network Analysis
  • Lecture: Sentiment lexicons; opinion mining; co-occurrence networks and semantic networks from corpus data
  • LADAL content: Sentiment Analysis and Network Analysis
  • Additional Readings:
    Liu, B. (2012). Sentiment analysis and opinion mining. Ch. 1–2
Week 11: Case Studies in Corpus Linguistics
  • Lecture: Corpus-based studies of grammar, lexis, and discourse; from method to interpretation; writing up corpus research
  • LADAL content: Corpus Linguistics with R
  • Additional Readings:
    Baker (2006), Ch. 7
Week 12: Project Workshop and Presentations
  • Lecture: Ethics in corpus research; future directions; communicating corpus findings to non-specialist audiences
  • Tutorial: Student project work
Core Readings
  • Baker, P. (2006). Using corpora in discourse analysis. Continuum.
  • Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics: Investigating language structure and use. Cambridge University Press.
  • Flanagan, J. (2025). Reproducibility in corpus linguistics. International Journal of Corpus Linguistics. https://doi.org/10.1075/ijcl.24113.fla
  • Gries, S. T. (2024). Frequency, dispersion, association, and keyness (Studies in Corpus Linguistics, Vol. 115). John Benjamins.
  • Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies, 5(1), 1–167.
  • Maier, D., et al. (2021). Applying LDA topic modeling in communication research. In Computational methods for communication science (pp. 13–38). Routledge.
  • McEnery, T., & Hardie, A. (2012). Corpus linguistics: Method, theory and practice. Cambridge University Press.
  • Wickham, H., & Grolemund, G. (2016). R for data science. O’Reilly. https://r4ds.had.co.nz

Introduction to Statistics in the Humanities and Social Sciences

Duration: 12 weeks (1h lecture + 1.5h tutorial per week)
Audience: Students and researchers in linguistics, psychology, education, sociology, and related fields
Assumed background: None
Aim: Provide a practical and conceptual foundation in quantitative methods, from probability and descriptive statistics through regression and mixed-effects modelling, using R throughout

Week 1: Introduction to Quantitative Research
  • Lecture: The role of quantitative methods in humanities and social sciences; an overview of statistical thinking; the research cycle; types of research questions
  • LADAL content: Introduction to Quantitative Reasoning
  • Additional Readings:
    Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. Ch. 1
    Baayen, R. H. (2008). Analyzing linguistic data. Ch. 1
Week 2: Basic Concepts in Quantitative Research
  • Lecture: Data types and measurement scales; variables, operationalisation, and construct validity; sampling and representativeness; reliability and validity
  • LADAL content: Basic Concepts in Quantitative Research
  • Additional Readings:
    Gries, S. T. (2013). Statistics for linguists. Ch. 1–2
Week 3: Getting Started with R — Part 1
  • Lecture: Introduction to R and RStudio; installing and loading packages; basic syntax and data structures; the tidyverse ecosystem
  • LADAL content: Getting Started with R
  • Additional Readings:
    Wickham & Grolemund (2016), Ch. 1–3
Week 4: Getting Started with R — Part 2: Loading and Handling Data
  • Lecture: Importing datasets from CSV, Excel, and text files; data cleaning and transformation; working with factors and missing values
  • LADAL content: Loading and Saving Data and Handling Tables in R
  • Additional Readings:
    Baayen (2008), Ch. 2
Week 5: R Basics for Statistical Analysis
  • Lecture: Vectors, factors, data frames, indexing, and subsetting; writing functions; applying operations across groups with dplyr
  • LADAL content: Getting Started with R — advanced sections
  • Additional Readings:
    Field, Miles & Field (2012), Ch. 2–3
Week 6: Descriptive Statistics
  • Lecture: Measures of central tendency and dispersion; frequency distributions; skewness and kurtosis; the normal distribution; summarising grouped data
  • LADAL content: Descriptive Statistics
  • Additional Readings:
    Baayen (2008), Ch. 3
    Winter, B. (2019). Statistics for linguists. Ch. 2
Week 7: Visualising Data
  • Lecture: Principles of effective visualisation; choosing the right graph type; histograms, box plots, scatter plots, and bar charts; ggplot2 grammar of graphics
  • LADAL content: Data Visualisation with R
  • Additional Readings:
    Wickham & Grolemund (2016), Ch. 14
Week 8: Hypothesis Testing and Power Analysis
  • Lecture: The logic of null hypothesis significance testing; t-tests, ANOVA; p-values and their interpretation; effect sizes; statistical power and sample size planning
  • LADAL content: Basic Inferential Statistics
  • Additional Readings:
    Field, Miles & Field (2012), Ch. 4
    Gries (2013), Ch. 3
Week 9: Correlation and Simple Regression
  • Lecture: Pearson and Spearman correlation; simple linear regression; interpreting intercepts and slopes; assumptions and diagnostics
  • LADAL content: Regression Analysis
  • Additional Readings:
    Baayen (2008), Ch. 4
Week 10: Multiple Regression and Model Diagnostics
  • Lecture: Multiple regression; multicollinearity; residual analysis; model comparison with AIC/BIC; stepwise and theory-driven model building
  • LADAL content: Regression Analysis — advanced sections
  • Additional Readings:
    Winter (2019), Ch. 5
Week 11: Logistic Regression
  • Lecture: Binary and ordinal outcomes; logistic regression model fitting and interpretation; odds ratios and predicted probabilities; the proportional odds model
  • LADAL content: Regression Analysis and Visualising and Analysing Survey Data
  • Additional Readings:
    Baayen (2008), Ch. 5
    Winter (2019), Ch. 6
Week 12: Mixed-Effects Models
  • Lecture: Why mixed effects? Random intercepts and random slopes; by-participant and by-item random effects; fitting and interpreting mixed models with lme4
  • LADAL content: Mixed-Effects Models
  • Tutorial: Student mini-projects applying learned methods to real datasets
  • Additional Readings:
    Gries (2013), Ch. 6
    Field, Miles & Field (2012), Ch. 12
Core Readings
  • Baayen, R. H. (2008). Analyzing linguistic data. Cambridge University Press.
  • Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. Sage.
  • Gries, S. T. (2013). Statistics for linguists. De Gruyter Mouton.
  • Wickham, H., & Grolemund, G. (2016). R for data science. O’Reilly. https://r4ds.had.co.nz
  • Winter, B. (2019). Statistics for linguists: An introduction using R. Routledge.

Advanced Statistics in the Humanities and Social Sciences

Duration: 12 weeks (1h lecture + 1.5h tutorial per week)
Audience: Students and researchers with prior knowledge of basic statistics who wish to apply advanced quantitative methods
Assumed background: Basic statistics (t-tests, regression) and intermediate R skills
Aim: Develop advanced skills in multivariate modelling, classification, clustering, and survey data analysis using R

Week 1: Advanced Data Management and Reproducible Workflows
  • Lecture: Organising complex datasets; reproducibility in advanced research; scripting and automating analysis pipelines; version control with Git
  • LADAL content: Reproducible Research and Creating R Notebooks
  • Additional Readings: Flanagan (2025), Ch. 1
Week 2: Review of Descriptive and Inferential Statistics
  • Lecture: Quick review of key concepts: distributions, t-tests, correlations, confidence intervals, effect sizes, and power
  • LADAL content: Descriptive Statistics and Basic Inferential Statistics
  • Additional Readings: Field, Miles & Field (2012), Ch. 1–4
Week 3: Advanced Regression — Multiple and Hierarchical Models
  • Lecture: Multiple regression; interaction terms; hierarchical (nested) models; mixed-effects models with random intercepts and slopes
  • LADAL content: Regression Analysis and Mixed-Effects Models
  • Additional Readings: Baayen (2008), Ch. 4–5; Winter (2019), Ch. 5–6
Week 4: Logistic Regression and Generalised Linear Models
  • Lecture: Binary and multinomial outcomes; model fitting and interpretation; goodness-of-fit; GLMs as a unified framework
  • LADAL content: Regression Analysis
  • Additional Readings: Winter (2019), Ch. 6
Week 5: Classification — Decision Trees
  • Lecture: Decision trees; recursive partitioning; overfitting and pruning; interpreting tree outputs; applications in linguistic classification problems
  • LADAL content: Tree-Based Models
  • Additional Readings: Gries (2013), Ch. 6
Week 6: Classification — Random Forests and Ensemble Methods
  • Lecture: Ensemble learning; bagging and boosting; random forests; variable importance; improving prediction accuracy and generalisability
  • LADAL content: Tree-Based Models
  • Additional Readings: James, Witten, Hastie & Tibshirani (2021), Ch. 8
Week 7: Clustering and Correspondence Analysis
  • Lecture: Unsupervised classification; k-means and hierarchical clustering; choosing the number of clusters; correspondence analysis for categorical data
  • LADAL content: Cluster and Correspondence Analysis
  • Additional Readings: Gries (2013), Ch. 7
Week 8: Survey and Questionnaire Data Analysis I
  • Lecture: Preparing survey data; dealing with missing values; Likert scales and their properties; descriptive analysis and visualisation of survey items
  • LADAL content: Visualising and Analysing Survey Data
  • Additional Readings: Field, Miles & Field (2012), Ch. 10; Baayen (2008), Ch. 6
Week 9: Survey and Questionnaire Data Analysis II
  • Lecture: Reliability (Cronbach’s α, McDonald’s ω); factor analysis and scale validation; cross-tabulations and chi-square tests; ordinal regression for Likert outcomes
  • LADAL content: Visualising and Analysing Survey Data
  • Additional Readings: Field, Miles & Field (2012), Ch. 11
Week 10: Dimension Reduction and Multivariate Techniques
  • Lecture: Principal Component Analysis (PCA); multidimensional scaling (MDS); detecting latent variables; applications to linguistic and social science data
  • LADAL content: Dimension Reduction Methods
  • Additional Readings: Gries (2013), Ch. 8
Week 11: Model Evaluation, Diagnostics, and Advanced Visualisation
  • Lecture: Residual analysis and outlier detection; model comparison and selection criteria (AIC, BIC, cross-validation); visualisation techniques for multivariate data
  • LADAL content: Data Visualisation with R and Regression Analysis
  • Additional Readings: Winter (2019), Ch. 7
Week 12: Applications and Student Mini-Projects
  • Lecture: Integrating advanced methods into humanities and social science research; ethical considerations; communicating complex statistical results; reproducibility revisited
  • Tutorial: Student project work applying classification, clustering, and survey analysis to real datasets
  • Additional Readings: Baayen (2008), Ch. 7; Field, Miles & Field (2012), Ch. 12
Core Readings
  • Baayen, R. H. (2008). Analyzing linguistic data. Cambridge University Press.
  • Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. Sage.
  • Flanagan, J. (2025). Reproducibility in corpus linguistics. International Journal of Corpus Linguistics. https://doi.org/10.1075/ijcl.24113.fla
  • Gries, S. T. (2013). Statistics for linguists. De Gruyter Mouton.
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning. Springer.
  • Winter, B. (2019). Statistics for linguists: An introduction using R. Routledge.

Back to top

Back to HOME